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Abstract 


Writing a good essay typically involves students revising an 
initial paper draft after receiving feedback. We present eRe- 
vise, a web-based writing and revising environment that uses 
natural language processing features generated for rubric- 
based essay scoring to trigger formative feedback messages 
regarding students’ use of evidence in response-to-text writ- 
ing. By helping students understand the criteria for using text 
evidence during writing, eRevise empowers students to bet- 
ter revise their paper drafts. In a pilot deployment of eRevise 
in 7 classrooms spanning grades 5 and 6, the quality of text 
evidence usage in writing improved after students received 
formative feedback then engaged in paper revision. 


Introduction 


With benefits such as minimizing human effort and assur- 
ing scoring consistency, natural language processing (NLP) 
has been used to develop many Automatic Essay Scoring 
(AES) systems that can reliably assess the content, structure, 
and quality of written prose (Shermis and Burstein 2003; 
2013). However, before providing students with final essay 
scores, engaging students in cycles of essay drafting and 
revising after feedback is also essential (Graham, Harris, 
and Santangelo 2015). This is because scoring without feed- 
back is typically not enough to guide students on how to 
revise an essay for further improvement. Standalone AES 
systems are thus increasingly being incorporated into Au- 
tomated Writing Evaluation (AWE) systems (Dikli 2006; 
Roscoe et al. 2014; Weigle 2013), which provide students 
with formative feedback in addition to (or instead of) essay 
scores. Formative feedback can guide students during revi- 
sion in ways that help students compensate for identified 
essay weaknesses. Although Foltz and Rosenstein (2015) 
showed that student writing could improve with revisions 
based on AWE feedback and Chapelle, Cotos, and Lee 
(2015) showed that successful revising is related to feedback 
accuracy, much AWE research remains to be done. 

This paper presents the design and first classroom eval- 
uation of eRevise, an AWE system for improving students’ 
ability to use text evidence — a dimension of writing that is 
important for college and career readiness. eRevise has been 
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designed for students in grades 5-6 taking the Response to 
Text Assessment (RTA) (Correnti et al. 2013), where stu- 
dents first write an essay in response to a source text and 
are then assessed using a detailed Evidence scoring rubric.! 
In particular, eRevise has been developed for deployment 
in a formative/classroom environment over two class pe- 
riods (in contrast to a summative/high-stakes usage). Stu- 
dents write their essays in the first period, then revise their 
essays in the second period after receiving formative feed- 
back automatically selected based on first draft AES. In con- 
trast to many AES systems that achieve high scoring relia- 
bility but do not address construct validity (Condon 2013; 
Perelman 2012), eRevise uses a rubric-based AES system to 
ensure that dimensions of the construct are well represented 
by the indicators used for scoring (Loukina et al. 2015). This 
in turn enables the development of an AWE algorithm for 
converting internal AES data structures into formative feed- 
back messages that are both tailored to each student’s writ- 
ing needs and aligned to the constructs of the scoring rubric. 
eRevise is also notable in focusing on evidence usage rather 
than on surface writing features, and on upper elementary 
rather than middle or high school students, which makes the 
application of NLP techniques particularly challenging. 

The next two sections describe the eRevise workflow, 
and the NLP techniques supporting eRevise’s AES and 
AWE components, respectively. This is followed by a class- 
room deployment and evaluation section demonstrating the 
promise of eRevise in supporting essay revision. Our de- 
ployment tested whether eRevise helped students: 1) im- 
prove the overall quality of their drafts when evaluated by 
human scorers using the RTA evidence rubric, and 2) in- 
crease the quantity and relevance/specifity of their text ev- 
idence usage when evaluated using NLP. Analyses of 143 
essays created by 5th and 6th grade students from 7 differ- 
ent classes support both hypotheses. 


System Usage and Architecture 


During the first class, a teacher reads an article aloud (with 
students following along on a hardcopy) about an effort by 


‘Although eRevise currently focuses only on the Evidence 
rubric, the full RTA as scored by humans comprises five di- 
mensions: Analysis, Evidence, Organization, Academic Style, and 
MUGS (Mechanics, Usage, Grammar, Spelling). 
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Figure 1: The architecture of the eRevise system. 


the Millenium Villages Project (MVP) to eradicate poverty 
in a Kenyan village.” After the teacher discusses predefined 
vocabulary and asks standardized questions at designated 
points, there is a prompt at the end of the article which asks 
students: “Based on the article, did the author provide a con- 
vincing argument that winning the fight against poverty is 
achievable in our lifetime? Explain why or why not with 3- 
4 examples from the text to support your answer.” At this 
point students use eRevise to write an essay in response to 
the prompt. Both the source article and the prompt appear on 
the screen, with students typing their drafts into a text area 
below the prompt. The purpose of this first usage of eRevise 
is to electronically collect students’ first drafts. 

Figure 1 shows the architecture of eRevise. After stu- 
dents submit their first drafts, eRevise’s AES component 
uses a previously developed RTA Evidence scoring algo- 
rithm (Zhang and Litman 2017) to extract features repre- 
senting the quality of text-based evidence usage in terms of 
constructs in the RTA Evidence rubric. Some of these fea- 
tures (described in the next section) are then passed as input 
to the AWE system’s feedback selection algorithm, which 
will in turn output a subset of predefined feedback messages 
that are believed to best address the problems of the first 
draft based on the features. These formative feedback mes- 
sages (although not the AES Evidence scores themselves) 
will be shown to students during the second class period. 

During the second class period, students login to eRe- 
vise and now revise their first drafts using eRevise. Figure 2 
shows a screenshot illustrating revision guided by formative 
feedback. The left top box shows a student’s first draft. This 
helps students to recall their first drafts and eases revising 
(e.g., by allowing cutting and pasting). The right-hand side 
of the screen shows the feedback on the first draft that was 
automatically selected by the AWE system. The left bottom 
box shows where students create their second drafts, hope- 
fully guided by the feedback displayed on the right.? 


?While the RTA has three forms (i.e., articles), eRevise cur- 
rently only supports AES for RT Amv. 

* After a student clicks submit, the AES system also scores the 
revised version of the student’s essay. Although eRevise does not 
share AES scores with students (due to its focus on formative feed- 
back rather than summative assessment), AES scores are included 
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Essay Analysis and Feedback Selection 


The ultimate goal of our research is to dynamically generate 
formative feedback that incorporates excerpts from students’ 
essays. However, to simplify the first version of eRevise, the 
current AWE system instead dynamically selects two of four 
pre-defined feedback messages to guide students in revis- 
ing first drafts. Table 1 shows these messages, ordered by a 
typical progression of evidence usage in student writing de- 
velopment. The messages reflect research on effective feed- 
back and conceptual frameworks for effective text evidence 
use (Wang, Matsumura, and Correnti 2018), and were cre- 
ated by content experts after an analysis of previously scored 
RTA essays (Correnti et al. 2013; Rahimi et al. 2014; 2017; 
Zhang and Litman 2017). These 4 messages were then or- 
ganized into groups of two based on the appropriateness of 
the messages for essays with differing evidence sophistica- 
tion (messages | and 2 for the least sophisticated, messages 
2 and 3 for more sophisticated, and messages 3 and 4 for 
the most sophisticated essays). Based on AES feature anal- 
ysis of evidence usage in the first draft (and further feature 
processing, described below), each student thus receives two 
feedback messages based on the group assigned to the essay. 


AKS Feature Extraction 


We have developed several AES systems for RTA assess- 
ment (Rahimi et al. 2017; Zhang and Litman 2017; 2018). 
Our first model (denoted by Rubric) (Rahimi et al. 2017) 
used NLP to represent an essay in terms of features that 
largely correspond to cells in the RTA Evidence rubric. This 
rubric, as well as the correspondence between the rubric and 
features that serve as input to the scoring model, are shown 
in Table 2. A subsequent model (denoted by SG) (Zhang and 
Litman 2017) introduced skip-gram word embeddings into 
the feature extraction process, in order to increase robust- 
ness by moving from lexical to semantic similarity compu- 
tation. Most recently, Zhang and Litman (2018) developed a 
neural network model with a co-attention layer (denoted by 
CO-ATTN) to eliminate human feature engineering. Table 
3 shows performance figures for each of these AES mod- 
els when evaluated using cross-validation on a previously 
collected RT Ayjyp corpus of 2970 essays. Although the 
neural CO-ATTN model has the best performance, to select 
formative feedback messages that address essay weaknesses 
in terms of rubric constructs, a more interpretable represen- 
tation of the essay is necessary. Therefore, SG is the AES 
system used in eRevise. In particular, two of the features 
used by SG for score prediction, namely Number of Pieces 
of Evidence (NPE) and Specificity (SPC), form the basis of 
eRevise’s feedback selection algorithm.* Table 4 shows an 
example first and second draft (with AES Evidence scores 
of 2 and 3, respectively), along with NPE and SPC values. 
Number of Pieces of Evidence (NPE) is an integer en- 
coding the number of evidence topics in the article that are 


in summary reports later shared with teachers. 

4 Although Concentration (CON) is also aligned with the rubric, 
the other two features are more aligned with the feedback and they 
are more consequential for improving evidence usage. 


Prompt: Consider the reasons given in the article for why we should and should not fund space exploration. Did the author convince 
you that “space exploration is desirable when there is so much that needs to be done on earth"? Give reasons for your answer. 


Support your reasons with 3-4 pieces of evidence from the text. 


First draft of your essay below 


In the story “A Brighter Future" yes the author convince me that winning the fight against poverty is achievable in our lifetime. People need a 
home just like animals need a home ,and its just not fair that we have homes and they don't. It says that "The plan is to get people out of 
poverty, assure them access to health care and help them stabilize the economy and quality of life in their communities." Another one Is 
“Villages get technical advice and practical items,such as fertilizer,medicine and school supplies.”Also it says that "It is hard for me to see 
people sick with preventable diseases, people who are near death when they shouldn't have to be.! just get scared and sad." These 3 examples 
mean that people don't need to go through poverty , it’s sad and scary for them and they don't need to go through this. 


Revise your essay below (You can copy and paste your original essay into the text box below and revise it.) 


MAKE YOUR ESSAY MORE CONVINCING (Help readers 
understand why you believe the fight against poverty is/isn’t 
achievable in our lifetime by following the suggestions in the two 
boxes below.) 


Use more evidence from the article 


+ Re-read the article and the writing prompt. 

* Choose at least three different pieces of evidence to 
support your argument. 

» Consider the whole article as you select your evidence. 


Provide more details for each piece of evidence you 
use 


» Add more specific details about each piece of evidence. 
e For example, writing, "The school fee was a 
problem” is not specific enough. It is better to 
write, "Students could not attend schoo! because 
they did not have enough money to pay the 
school fee." 
» Use your own words to describe the evidence. 


Figure 2: A formative feedback screenshot of the eRevise system. 


No. | Name Feedback 
1 Use more evidence | eRe-read the article and the writing prompt. 
from the article eChoose at least three different pieces of evidence to support your argument. 
eConsider the whole article as you select your evidence. 
2 Provide more details for | eAdd more specific details about each piece of evidence. 
each piece of evidence —For example, writing, “The school fee was a problem” is not specific enough. It is better to 
you use write, Students could not attend school because they did not have enough money to pay the school 
fee.” 
eUse your own words to describe the evidence. 
3 Explain the evidence eTell your reader why you included each piece of evidence. Explain how the evidence helps to 
make your point. 
4 Explain how the evi- | eTie the evidence not only to the point you are making within a paragraph, but to your overall 
dence connects to the | argument. 
main idea & elaborate eElaborate. Give a detailed and clear explanation of how the evidence supports your argument. 


Table 1: Four feedback messages predefined by content experts, based on progression of evidence use. 


mentioned in the essay. A fixed size sliding window algo- 
rithm is used to extract this feature. If a window? contains 
at least two similar words from a manually crafted list of 
main topics and associated words from the article®, the win- 
dow is determined to contain text-based evidence related to 
the topic. Word embedding is used to calculate word simi- 
larity, with two words considered as similar after threshold- 
ing, thus enabling both lexical and semantic matching (e.g., 
a student’s use of “power” will match “electricity” in the 
article). In Table 4, the NPE features indicate that the stu- 
dent used text evidence from more topics after revision, i.e., 
AES identifies one topic (Hospital) in the first draft versus 
three (Hospital, Farming, and Malaria) in the revised draft - 
although Malaria is actually a false positive. 


In eRevise, all windows are of size 6, which optimized AES 
performance on previously scored training essays. 

For the RTA article, the 4 topics are Hospital, Malaria, Farm- 
ing, and School (Rahimi et al. 2017). Computing similarity with 
pre-defined topics is typical in content-based AES (Liu et al. 2014). 
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Specificity (SPC) is a vector of integers that encodes the 
number of specific article examples mentioned in the es- 
say. The length of this vector is the number of manually 
crafted categories, which is 8 for the RTA article (Rahimi 
et al. 2017). A window-based algorithm is again used for 
feature extraction, now using a different manually crafted 
list of words associated with examples and categories from 
the article, where all examples are assigned to different cate- 
gories. For example, in the sliding window “see people sick 
with preventable diseases’, the essay words “preventable” 
and “diseases” match the article word list (“malaria com- 
mon disease preventable treatable’) for one of the 6 exam- 
ples associated with SPC category 4’. Therefore, the algo- 
rithm increments the value of SPC4 by 1. 


AWE Feedback Selection 


Although the SPC values (which count the number of times 
the student mentions specific examples from the article) 
were useful for developing the AES system via supervised 


’This category talks about malaria before the MVP program. 


1 


2 


3 


4 


dence 


Number of Pieces of evi- 


Features one or no pieces of evi- 
dence (NPE) 


Features at least 2 pieces of evi- 
dence (NPE) 


Features at least 3 pieces of evi- 
dence (NPE) 


Features at least 3 pieces of evi- 
dence (NPE) 


Relevance of evidence 


Selects inappropriate or irrelevant 
details from the text to support key 
idea (SPC); references to text fea- 
ture serious factual errors or omis- 
sions 


Selects some appropriate and rele- 
vant evidence to support key idea, 
or evidence is provided for some 
ideas, but not actually the key idea 
(SPC); evidence may contain a fac- 
tual error or omission 


Selects pieces of evidence from the 
text that are appropriate and rele- 
vant to key idea (SPC) 


Selects evidence from the text that 
clearly and effectively supports key 
idea 


Specificity of evidence 


Provides general or cursory evi- 
dence from the text (SPC) 


Provides general or cursory evi- 
dence from the text (SPC) 


Provides specific evidence from the 
text (SPC) 


Provides pieces of evidence that are 
detailed and specific (SPC) 


Elaboration of Evidence 


Evidence may be listed in a sen- 
tence (CON) 


Evidence provided may be listed 
in a sentence, not expanded upon 


Attempts to elaborate upon evi- 
dence (CON) 


Evidence must be used to support 
key idea / inference(s) 


(CON) 


Summarize entire text or copies 
heavily from text (in these cases, 
the response automatically receives 
al) 


Plagiarism 


Table 2: Rubric for scoring the Evidence dimension of RTA. The abbreviations in the parentheses identify features used by the 
AES system that are aligned with specific assessment criteria (Rahimi et al. 2017). 


AES Model | QWK 
Rubric 0.632 
SG 0.653 
CO-ATTN 0.697 


Table 3: Quadratic Weighted Kappa (QWK) of different 
AES models. The CO-ATTN model significantly outper- 
forms the Rubric and SG models, respectively (p < 0.05). 


machine learning, we found them to be less useful for de- 
veloping a feedback selection algorithm because the count 
included duplicate cases, and because the use of word- 
embedding meant false positive examples were identified 
during AES. The AWE system thus calculates a new fea- 
ture named SPC'_Total_Merged, which is a count of the 
number of non-duplicate, unique article examples from the 
SPC feature vector. For example, in the sentence “for me to 
see people sick with preventable diseases”, bolding shows 
the first example found by the algorithm (window-size is 
6, matched words are “people” and “sick”’), while underlin- 
ing shows the second (matched words are “preventable” and 
“diseases”). While the SPC feature considers these as 2 ex- 
amples, SPC_Total_Merged considers them as | unique 
example. For the first draft in Table 4, the algorithm thus re- 
duces the SPC total of 11 (from AES, equation below) to a 
smaller merged total of 6 (for AWE). 

After extracting the above features for our previously col- 
lected corpora of scored essays, AWE feedback selection 
was guided by three assumptions: 1) the NPE feature indi- 
cates the breadth of unique topics, 2) the SPC_Total_Merged 
feature indicates the number of unique pieces of evidence 
the student located and used; and 3) a matrix of these two 
indicators could match each essay to appropriate feedback. 
Given we did not need to modify NPE, the following equa- 
tions were used to calculate SPC aw g¢ for feedback selec- 
tion. 


N 
SPCtotal = >) SPC; (1) 
i=l 
SPCtotat, where N is the number of categories in SPC, cal- 
culates the total number of matches the computer finds be- 
tween students’ first drafts and examples we are looking for. 
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E 


SPCimportant = s SPC; 
i=S 

SPCimportant, Where S and EF are the start and end in- 
dices of important categories, calculates the total number 
of matched examples from four primary topics for evi- 
dence usage (hospital, malaria, farming and school). In the 
RT Ayy p, content experts identified these categories as pri- 
mary because they are the topics on which students can pro- 
vide specific examples of improvement, while other SPC 
categories refer to more general examples from the article. 


(2) 


SPCtotat — SPC_Total_Merged 3) 
SPCiotal 


DR calculates the duplication rate of matched examples, by 
using SPC_Total_Merged to calculate the proportion of 
duplicate evidence from SPCtota1- 


DR= 


SPCawe = RND(SPCimportant * (1 = DR)) (4) 


SPC awe adjusts the number of important matched ex- 
amples by the duplication rate. This produces a new score 
for generating feedback, representing the number of unique 
matched examples from four primary topics. We round the 
number to get an integer used in the conditional statement 
below. 


L, if SPCawe <3 
M, if SPCawe > 3and SPCawe <5 
A 


SPCimh = 
otherwise 


(5) 
SPCimnp is a categorical variable for SPC 4w x that indi- 
cates low (L), medium (M), or high (H) values. 

Finally, the AWE system uses N PE (computed during 
AES) and SPC, to select the two most appropriate feed- 
back messages for the essay based on Table 5. The content 
experts used a previously scored corpus (Zhang and Litman 
2017) as development data to manually design this table. 

For the first draft in Table 4, NPE = 1, SPCiotqi = 11, 
SPC Total_Merged = 6, SPCimportant = 6, DR = 
0.455. SPCawe = 3, and SPCimpn = M. Therefore, after 


First Draft 


Essay 


In the story “A Brighter Future”,yes the author convince me that winning the fight against poverty 
is achievable in our lifetime. People need a home just like animals need a home ,and its just not fair 
that we have homes and they don’t. It says that “The plan is to get people out of poverty, assure them 
access to health care and help them stabilize the economy and quality of life in their communities.” 
Another one is “Villages get technical advice and practical items,such as fertilizer,medicine and 
school supplies.”Also it says that “It is hard for me to see people sick with preventable diseases, 
people who are near death when they shouldn’t have to be.I just get scared and sad.” These 3 
examples mean that people don’t need to go through poverty , it’s sad and scary for them and they 
don’t need to go through this. 


Features 


NPE | SPCI | SPC2 | SPC3 | SPC4 | SCP5 | SCP6 | SCP7 | SPC8 | SPC_Total_Merged 
1 1 2 1 3 0 1 1 2 6 


Second Draft 


Essay 


In the story “A Brighter Future” yes the author convince me that winning the fight against poverty 
is achievable in our lifetime. Yes we need to win the fight against poverty because everybody needs 
a home, shelter, food,and money. It say that “Their crops were dying because they could not 
afford the necessary fertilizer and irrigation” Another one is that “Its hard for me to see people sick 
with preventable diseases,people who are near death when they shouldn’t have to be.” Also “.Little 
kids were wrapped in cloth on their mothers backs,or running around in bare feet and tattered 
clothing.” These three examples mean that we need to help them have a better life and a better 
home than the busty,dirty ground. 


Features 


NPE | SPCI | SPC2 | SPC3 | SPC4 | SCP5 | SCP6 | SCP7 | SPC8 | SPC_Total-Merged 
3 2 1 2 3 2 1 0 1 6 


Table 4: Examples of a student’s first and second essay drafts, showing the NLP analyses during AES that are needed for AWE. 
For each essay, the text-based evidence identified during AES that is used to compute the essay’s SPC values is shown in italics 
(and in bold if only identifed in the second draft). eRevise would display feedback messages | and 2 for the first essay draft. 


| Feature Value | 
NPE 0 0 0 1 2 3 4 1 1 2 2 3 4 3 4 
SPCimh L M H L L L L M H H M M M H H 

| Feedback Messages | 1,2 | 1,2 | 1,2 | 1,2 | 1,2 | 1,2 | 1,2 | 1,2 | 1,2 | 1,2 | 2,3 | 2,3 | 2,3 | 3,4 | 3,4 | 


Table 5: Lookup table for feedback selection. 


consulting Table 5, eRevise would display feedback mes- 
sages | and 2 for this essay (as for the essay displayed in 
Figure 2). 

In sum, the AWE process results in all students receiv- 
ing two (of four possible) feedback messages that are se- 
lected based on the AES feature analysis and are thus tar- 
geted to improving the quality of each student’s particular 
essay. Note that students will receive feedback even when 
AES predicts a score of 4 for the first draft. In most cases, 
such students will receive the third and fourth feedback mes- 
sages focusing on evidence elaboration. 


Experimental Deployment and Results 


Our first deployment of eRevise took place in two public 
tural parishes in Louisiana. Seven 5th and 6th-grade teach- 
ers had all students in one of their English Language Arts 
classes write and revise an essay using eRevise. A total of 
143 students completed all tasks. We test two hypotheses: 


H1: eRevise helps students improve the overall quality of 
their drafts, as evaluated by human scorers using the RTA 
evidence rubric. 


H2: eRevise increases the quantity and_ rele- 
vance/specificity® of evidence that students use from the 
RTA source text, as evaluated using NLP features. 


Corresponding to row 1 and rows 2-3 of Table 2. 
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The outcome measure for testing H1 is a human-produced 
RTA Evidence score. After the deployment, a trained hu- 
man grader used the rubric from Table 2 to score all es- 
says, without knowing whether an essay was a first or sec- 
ond draft. A paired t-test comparing the first and second 
draft Evidence scores (n=143) supports H1, as the scores 
improved from first ( MEAN = 2.62, SD = 0.95) to sec- 
ond (MEAN = 2.72, SD = 0.92) drafts with trending 
Statistical significance (p < 0.08). The grader also scored 
essays for the other four RTA dimensions (recall footnote 
1). In contrast to Evidence (for which eRevise provided for- 
mative feedback to guide revision), there were no signifi- 
cant or trending score improvements for any of these other 
RTA dimensions (all p > .29). Finally, the scatter plot in 
Figure 3a shows that the overall improvement in the Evi- 
dence dimension was observed despite potential ceiling ef- 
fects: 28 students received the maximum score of 4 on their 
first drafts, 16 of whom also received the maximum on their 
second drafts. The plot also shows that although the scores 
increased for 34 students, the scores did not change for the 
majority of students (and less often even decreased). 


We thus explore the use of more fine-grained outcome 
measures that have a stronger relationship to the eRevise 
feedback that guided student revision. To test H2, we use 
the NPE and SPC _Total_Merged features as automat- 
ically computed by eRevise during its deployment to ap- 
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Figure 3: (a) RTA Evidence scores before and after revision. (b) Value changes for the NPE feature. (c) Value changes for the 
SPC_Total_Merged feature. (d) RTA Evidence scores for essays receiving feedback messages | and 2. (e) RTA Evidence scores 
for essays receiving feedback messages 2 and 3. (f) RTA Evidence scores for essays receiving feedback messages 3 and 4. 


proximate evidence quantity and relevance/specificity, re- 
spectively. Paired t-tests (n=143) for both support H2. The 
NPE feature values improved significantly (p < 0.003) 
from first (MEAN = 2.61, SD = 1.27) to second draft 
(MEAN = 2.81, SD = 1.08). The SPC_Total_Merged 
feature values also improved significantly (p < 0.001) from 
first (MEAN = 9.65, SD = 4.94) to second drafts 
(MEAN = 11.15, SD = 5.39). For NPE, the histogram 
in Figure 3b shows that more students added rather than re- 
moved evidence (30 versus 8 students). Although 105 stu- 
dents showed no evidence change, 43 were already at ceil- 
ing with NPE values of 4 in the first draft. For SPC, the his- 
togram in Figure 3c shows that a large majority of students 
(101) increased the number of specific article examples that 
they incorporated into their essays. 33 other students showed 
no change, while only 9 students removed specific examples. 


Recall the 16 students in Figure 3a who were at ceiling 
when the RTA Evidence score was used as the outcome mea- 
sure. By instead using the SPC_Total_Merged values as 
the outcome, these 16 students can now be seen to show im- 
provement from their first drafts MEAN = 12.69, SD = 
4.63) to the second drafts (MEAN = 13.25, SD = 5.20), 
with trending statistical significance (p < 0.095). 

Finally, Figure 3d shows how evidence scores changed 
for the 45 essays receiving feedback messages 1 and 2. The 
evidence score improvements from first MEAN = 2.33, 
SD = 0.93) to second (MEAN = 2.64, SD = 0.98) 
drafts were statistically significant (p = 0.02). Figure 3e 
shows the score changes for the 27 essays receiving feed- 
back messages 2 and 3. The evidence scores only slightly 
improved from first (MEAN = 2.22, SD = 0.85) to sec- 
ond (MEAN = 2.26, SD = 0.94) drafts. Figure 3f shows 
that for the 71 essays receiving messages 3 and 4, the evi- 
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dence scores were almost the same from first (MEAN = 
2.94, SD = 0.89) to second (MEAN = 2.94, SD 
0.81) drafts. These three figures suggest that drafts with 
the least sophisticated evidence usage had the most room 
for improvement. It is also interesting to relate these three 
feedback-based groupings to essay RTA Evidence scores. 
40.63% of drafts receiving Evidence scores of | or 2 re- 
ceived feedback messages | and 2. 62.03% of drafts receiv- 
ing Evidence scores or 3 or 4 received messages 3 and 4. 
Although only 27 essays received messages 2 and 3, 71.43% 
of these drafts received Evidence scores of 2 or 3. 


Current and Future Directions 


We are about to begin the next deployment of eRevise, 
which will extend our work in two ways. First, to better 
determine the benefit of using AES to adaptively guide re- 
vision, we have added a control condition where eRevise 
will display the same generic feedback message to all stu- 
dents: “MAKE YOUR ESSAY MORE CONVINCING - 
Help readers understand why you believe the fight against 
poverty is/isn’t achievable in our lifetime.” This is in con- 
trast to the existing eRevise adaptive feedback, where stu- 
dents receive different messages based on AES. Second, stu- 
dents will use eRevise for two different forms of the RTA 
(i.e., RT Agpace in addition to RT'Ayyp). While we have 
already trained a SG model for scoring RT Agpace (Zhang 
and Litman 2017), Table 5 needs to be verified to ensure the 
validity of our feedback selection algorithm. We are also ex- 
ploring adding CO-ATTN scores to the lookup table. 

For the longer term, we plan to extend our research in 
other ways. To score a new RTA form, human effort is cur- 
rently necessary to define topical components, e.g., creating 
a list of topics and a list of examples for scoring RT’ Agpace. 


While we have developed pilot data-driven methods that can 
extract such topical components automatically (Rahimi and 
Litman 2016), our methods need to be improved so that they 
do not degrade SG model performance. eRevise will also 
be enhanced to provide feedback for Organization, a second 
substantive RTA writing dimension for which we already 
have a pilot AES (Rahimi et al. 2017). We also plan to move 
from feedback selection to more personalized feedback gen- 
eration, and to create a teacher dashboard which can auto- 
matically generate summaries such as Figure 3a. Finally, 
since eRevise’s feedback encourages students to add more 
concrete examples from the article, some students may sim- 
ply copy and paste examples rather than use their own words 
as the feedback suggests. While the human RTA rubric (last 
row in Table 2) addresses plagiarism, eRevise currently does 
not. We thus plan to incorporate the detection of different 
types of adversarial essays into AES. 


Conclusions 


eRevise is an AWE system for text evidence usage that uses 
NLP features produced by a rubric-based AES system to au- 
tomatically select formative feedback messages most appro- 
priate to a student’s needs. By increasing access to feedback 
on a substantive and important writing dimension, eRevise 
has the potential to reduce demands on teachers and to build 
students’ knowledge of effective text evidence usage. We 
first described how eRevise uses NLP techniques to evalu- 
ate draft essays and to select appropriate formative feedback 
messages to guide later revision. Experimental results from a 
first deployment in 7 classrooms showed that eRevise helped 
students improve their text evidence usage after receiving 
formative feedback and engaging in essay revision. 
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